False, a correct model may have large sampling variability
False, many nuisance variables create higher sampling variability
False, \(R^2\) can be inflated with additional X variables, even if they do not add to the model.
True, since all values either use the SSE in their calculation either directly or with the log function.
True, the only case where \(SSE_P\) would equal \(Press_p\) is when the diagonal of the Hat matrix is all zero.
True, since \(BIC\) adds heavier penalty based \(plog(n)\) rather than the \(AIC\) whic just uses \(2p\).
True, since stepwise regression will pick the model with the optimal value of the criterion (usually the lowest).
## SSE.mean
## [1,] 34.876531 7.00
## [2,] 34.621515 6.75
## [3,] 18.712425 6.50
## [4,] 7.229235 6.00
## [5,] 5.559057 5.50
## [6,] 5.013139 5.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.377e-06 2.141e-05 3.467e-05 4.587e-05 5.618e-05 2.348e-04
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 2.616e-17 8.674e-18 4.441e-16
## [,1] [,2]
## [1,] 2.058886 2
## [2,] 3.075914 3
## [3,] 4.006368 4
## [4,] 5.890989 6
## [5,] 7.951301 8
## [6,] 10.045522 10
## a.) Yes there is a correct model, that contains the least amount of
model bias, which as can be seen in the plot, is with a high model
order.
plot(l.order, variance.ave, type='o',xlab="order of model", ylab="variance", main="model variance")
The model variance increases when we make more complex models with
higher order. We can also see that as the model variance increases, the
error variance decreases.
plot(l.order, bias2.ave, type='o',xlab="order of model", ylab="bias^2", main="squared model bias")
The model bias behaves opposite of the model variance, where it
decreases as the model complexity increases. This relationship is
parallel with the error variance, in which they both decrease as model
complexity decreases.
Since the error variance behaves very much like the bias, we would say that the bias is the dominant component, so yes, it does depend on the error variance.
plot(
l.order,
err2.mean.ave,
type = 'o',
xlab = "order of model",
ylab = "mean-squared-estimation-error",
main = "mean-squared-estimation-error"
)
According to the mean-squared-estimation-error, the best model would be
the highest order model with the most terms. The answer does not depend
on the error variance but it does depend on the model variance.
The \(E(SSE)\) decreases as the order of the model decreases. As the noise increases, the expected value of the SSE grows larger with each simulation
plot(
l.order,
SSE.mean / (n - l.order - 1),
type = 'o',
xlab = "order of model",
ylab = "E(MSE)",
main = "E(MSE)"
)
abline(h = sigma.e ^ 2, lty = 2, col = 2)
# Loading in the data
diab <- read.table("diabetes.txt", header = T)
diab
any(is.na(diab$frame))
## [1] FALSE
diab$frame[which(diab$frame == "")] <- NA
any(is.na(diab$frame))
## [1] TRUE
drops=c("id","bp.2s", "bp.2d")
diab=diab[,!(names(diab)%in%drops)]
str(diab)
## 'data.frame': 403 obs. of 16 variables:
## $ chol : int 203 165 228 78 249 248 195 227 177 263 ...
## $ stab.glu: int 82 97 92 93 90 94 92 75 87 89 ...
## $ hdl : int 56 24 37 12 28 69 41 44 49 40 ...
## $ ratio : num 3.6 6.9 6.2 6.5 8.9 ...
## $ glyhb : num 4.31 4.44 4.64 4.63 7.72 ...
## $ location: chr "Buckingham" "Buckingham" "Buckingham" "Buckingham" ...
## $ age : int 46 29 58 67 64 34 30 37 45 55 ...
## $ gender : chr "female" "female" "female" "male" ...
## $ height : int 62 64 61 67 68 71 69 59 69 63 ...
## $ weight : int 121 218 256 119 183 190 191 170 166 202 ...
## $ frame : chr "medium" "large" "large" "large" ...
## $ bp.1s : int 118 112 190 110 138 132 161 NA 160 108 ...
## $ bp.1d : int 59 68 92 50 80 86 112 NA 80 72 ...
## $ waist : int 29 46 49 33 44 36 46 34 34 45 ...
## $ hip : int 38 48 57 38 41 42 49 39 40 50 ...
## $ time.ppn: int 720 360 180 480 300 195 720 1020 300 240 ...
The variables chol, stab.glu, hdl, ratio, glyhb, age, height, weight, bp.1s, bp.1d, waist, hip, and time.ppn are quantitative. The variables location, gender, and frame are qualitative.
library(ggplot2)
ggplot(data = diab, aes(x = glyhb)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).
The distribution of glyhb is very right skewed, with lots of values spreading to a very extensive tail.
library(tidyr)
ggplot(gather(diab[, c(
"chol",
"stab.glu",
"hdl",
"ratio"
)], cols, value), aes(x = value)) +
geom_histogram(binwidth = 20) + facet_grid(. ~ cols)
## Warning: Removed 3 rows containing non-finite values (stat_bin).
ggplot(gather(diab[, c(
"age",
"height",
"weight",
"bp.1s"
)], cols, value), aes(x = value)) +
geom_histogram(binwidth = 20) + facet_grid(. ~ cols)
## Warning: Removed 11 rows containing non-finite values (stat_bin).
ggplot(gather(diab[, c(
"bp.1d",
"waist",
"hip",
"time.ppn"
)], cols, value), aes(x = value)) +
geom_histogram(binwidth = 30) + facet_grid(. ~ cols)
## Warning: Removed 12 rows containing non-finite values (stat_bin).
pie(table(diab$location))
pie(table(diab$gender))
pie(table(diab$frame))
diab$logGlyhb <- log(diab$glyhb)
diab$recipGlyhb <- 1 / (diab$glyhb)
diab$sqrtGlyhb <- sqrt(diab$glyhb)
ggplot(data = diab, aes(x = logGlyhb)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).
ggplot(data = diab, aes(x = recipGlyhb)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).
ggplot(data = diab, aes(x = sqrtGlyhb)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).
After multiple transformations, it seems that the transformation of
\(\frac{1}{glyhb}\) appears the most
normal.
diab$"glyhb*" <- diab$recipGlyhb
diab$glyhb <- diab$"glyhb*"
library(GGally)
library(plotly)
ggplotly(ggpairs(diab[,c("chol", "stab.glu", "hdl", "ratio", "age", "height", "weight", "bp.1s", "bp.1d", "waist", "hip","time.ppn")]))
From the scatterplot matrix, there are no signs of non-linearity in the data amongst the quantitative predictor variables.
ggplot(data = diab, aes(x = glyhb, color = gender)) + geom_boxplot()
## Warning: Removed 13 rows containing non-finite values (stat_boxplot).
ggplot(data = diab, aes(x = glyhb, color = frame)) + geom_boxplot()
## Warning: Removed 13 rows containing non-finite values (stat_boxplot).
The difference between genders is very minimal. The difference in Glycosolated Hemoglobin for different frames are more noticeable, with large frames having the most spread and the lowest median. Small frames has the highest median. Medium frame has most outliers.
set.seed(10)
library(caTools)
sample <- sample.split(diab$glyhb, SplitRatio = 0.5)
train <- subset(diab, sample == TRUE)
test <- subset(diab, sample == FALSE)
# simple way is to put labels on both datasets and remerge for comparisons
train$label <- rep("train", nrow(train))
test$label <- rep("test",
nrow(test))
# Combined dataset now has labels for which is in train and which is in test/validation
combin <- rbind(train, test)
combin
ggplot(data = combin, aes(x = glyhb, color = label)) + geom_boxplot()
## Warning: Removed 13 rows containing non-finite values (stat_boxplot).
ggplot(data = combin, aes(x = stab.glu, color = label)) + geom_boxplot()
ggplot(data = combin, aes(x = ratio, color = label)) + geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
ggplot(data = combin, aes(x = age, color = label)) + geom_boxplot()
ggplot(data = combin, aes(x = bp.1s, color = label)) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
ggplot(data = combin, aes(x = waist, color = label)) + geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
We can see from the boxplots that the difference between the training
and the validation data sets is very minimal. All these variables have
similar distributions.
diab <- diab[,c(-17, -18, -19, -20)]
diab
# Fitting a first order model
# lm(glyhb ~ chol + stab.glu + hdl + ratio + factor(location) + age + factor(gender) + height)
model1 <- lm(glyhb~., diab)
anova(model1)
There are 15 regression coefficients along (16 with the intercept). The \(MSE = 0.00133\).
library(MASS)
boxcox(model1)
The Box-cox method shows that no further transformation of the response
variable is needed.
library(leaps)
reg <- regsubsets(glyhb~., data = diab, nvmax = 16)
sumReg <- summary(reg)
sumReg
## Subset selection object
## Call: regsubsets.formula(glyhb ~ ., data = diab, nvmax = 16)
## 16 Variables (and intercept)
## Forced in Forced out
## chol FALSE FALSE
## stab.glu FALSE FALSE
## hdl FALSE FALSE
## ratio FALSE FALSE
## locationLouisa FALSE FALSE
## age FALSE FALSE
## gendermale FALSE FALSE
## height FALSE FALSE
## weight FALSE FALSE
## framemedium FALSE FALSE
## framesmall FALSE FALSE
## bp.1s FALSE FALSE
## bp.1d FALSE FALSE
## waist FALSE FALSE
## hip FALSE FALSE
## time.ppn FALSE FALSE
## 1 subsets of each size up to 16
## Selection Algorithm: exhaustive
## chol stab.glu hdl ratio locationLouisa age gendermale height weight
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " "*" " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " "*" " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " "*" " " "*" " " " " " "
## 6 ( 1 ) " " "*" " " "*" "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" " " "*" "*" "*" " " " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " "*" "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## framemedium framesmall bp.1s bp.1d waist hip time.ppn
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " " " "*" " " "*"
## 6 ( 1 ) " " " " " " " " "*" " " "*"
## 7 ( 1 ) " " " " " " " " "*" " " "*"
## 8 ( 1 ) "*" " " " " " " "*" " " "*"
## 9 ( 1 ) "*" " " "*" " " "*" " " "*"
## 10 ( 1 ) " " " " " " " " "*" "*" "*"
## 11 ( 1 ) "*" " " " " " " "*" "*" "*"
## 12 ( 1 ) "*" " " "*" " " "*" "*" "*"
## 13 ( 1 ) "*" " " "*" " " "*" "*" "*"
## 14 ( 1 ) "*" " " "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
# Dataframe for all the criterion (AIC is not printed by regsubset)
data.frame(
SSE = sumReg$rss,
R2 = sumReg$rsq,
R2adj = sumReg$adjr2,
Cp = sumReg$cp,
BIC = sumReg$bic
)
### Now we find which model had the optimum value
# SSE
which.min(sumReg$rss)
## [1] 16
# R2
which.max(sumReg$rsq)
## [1] 16
# R2adj
which.max(sumReg$adjr2)
## [1] 7
# Cp
which.min(sumReg$cp)
## [1] 7
# BIC
which.min(sumReg$bic)
## [1] 4
diab <- na.omit(diab)
nullModel1 <- lm(glyhb~1, data = diab)
stepAICfor <- stepAIC(
nullModel1,
scope = list(upper = "~chol + stab.glu + hdl + ratio + factor(location) + age + factor(gender) +height+weight+factor(frame)+bp.1s+bp.1d+waist+hip+time.ppn",
lower = ~ 1), direction = "forward", k = 2, trace = FALSE
)
stepAICfor$anova
#Since we know that Cp and AIC are proportional, we can pick the model based off the Cp, which was model 7, we build it here
model7 <- lm(glyhb ~ chol + stab.glu + ratio + factor(location) + age + factor(frame) +
waist + time.ppn,
data = diab
)
AIC(model7)
## [1] -1379.226
# Pick the final model from forward selection
modelfs1 <- lm(stepAICfor$model)
#directioly compare the models
model7
##
## Call:
## lm(formula = glyhb ~ chol + stab.glu + ratio + factor(location) +
## age + factor(frame) + waist + time.ppn, data = diab)
##
## Coefficients:
## (Intercept) chol stab.glu
## 3.524e-01 -6.593e-05 -4.968e-04
## ratio factor(location)Louisa age
## -2.912e-03 6.846e-03 -6.341e-04
## factor(frame)medium factor(frame)small waist
## -3.853e-03 -1.350e-03 -1.133e-03
## time.ppn
## -1.180e-05
modelfs1
##
## Call:
## lm(formula = stepAICfor$model)
##
## Coefficients:
## (Intercept) stab.glu age
## 3.494e-01 -4.952e-04 -6.216e-04
## ratio waist time.ppn
## -2.910e-03 -1.093e-03 -1.177e-05
## factor(location)Louisa chol
## 6.523e-03 -7.209e-05
We can see that regsubset model picks the variables chol, stab,glu, ratio, location, age, frame, waist, and time.ppn. The forward selection picks the model with all of the same variables but does not include frame. Lets see the difference in AICs.
AIC(model7)
## [1] -1379.226
AIC(modelfs1)
## [1] -1382.494
We can see that modelfs1 has the lowest AIC.
plot(modelfs1)
We can see that the Residuals versus Fitted values shows random scatter, with no obvious signs of patterns. From the Normal Q-Q plot, we can see approximate normality in the residuals. Overall, the model appears to be adequate.
model2 <- lm(glyhb ~ .^2, data = diab)
aovModel2 <- anova(model2)
# Get the last value, which is the MSE
tail(aovModel2$`Mean Sq`, n = 1)
## [1] 0.001221709
Immediately my concern is that we have a extremely large amount of unnecessary variables, mostly in the form of interactions.
stepAICfor2 <- stepAIC(
nullModel1,
scope = list(lower = nullModel1, upper = model2),
k = 2,
direction = "forward",
trace = F
)
modelfs2 <- lm(stepAICfor2$model)
# Compare to first model
AIC(modelfs1)
## [1] -1382.494
AIC(modelfs2)
## [1] -1395.439
We notice that the model with interaction terms has a lower AIC than the model with no interaction term.
plot(modelfs2)
Both the residual vs fitted values plot and the normal qq plot show that
the model is adequate. There is no patterns in the residuals versus
fitted values and the normal qq plot shows no deviations from
approximate normality.
d.)
# We perform forward stepwise procedure using AIC
stepAIC(
nullModel1,
scope = list(lower = nullModel1, upper = modelfs2),
k = 2,
direction = "forward"
)
## Start: AIC=-2173.55
## glyhb ~ 1
##
## Df Sum of Sq RSS AIC
## + stab.glu 1 0.39753 0.56182 -2367.4
## + age 1 0.15016 0.80918 -2233.9
## + ratio 1 0.12108 0.83827 -2220.9
## + waist 1 0.09755 0.86180 -2210.8
## + location 1 0.00655 0.95280 -2174.1
## <none> 0.95934 -2173.6
## + time.ppn 1 0.00126 0.95809 -2172.0
##
## Step: AIC=-2367.39
## glyhb ~ stab.glu
##
## Df Sum of Sq RSS AIC
## + age 1 0.048669 0.51315 -2398.6
## + waist 1 0.028792 0.53303 -2384.6
## + ratio 1 0.027939 0.53388 -2384.1
## + location 1 0.005821 0.55600 -2369.2
## + time.ppn 1 0.004368 0.55745 -2368.2
## <none> 0.56182 -2367.4
##
## Step: AIC=-2398.55
## glyhb ~ stab.glu + age
##
## Df Sum of Sq RSS AIC
## + ratio 1 0.0214844 0.49167 -2412.2
## + waist 1 0.0212472 0.49190 -2412.0
## + stab.glu:age 1 0.0090485 0.50410 -2403.1
## + location 1 0.0052488 0.50790 -2400.3
## + time.ppn 1 0.0047858 0.50836 -2400.0
## <none> 0.51315 -2398.6
##
## Step: AIC=-2412.21
## glyhb ~ stab.glu + age + ratio
##
## Df Sum of Sq RSS AIC
## + waist 1 0.0125184 0.47915 -2419.7
## + stab.glu:age 1 0.0069134 0.48475 -2415.4
## + time.ppn 1 0.0056432 0.48602 -2414.4
## + location 1 0.0053102 0.48636 -2414.2
## <none> 0.49167 -2412.2
## + age:ratio 1 0.0004177 0.49125 -2410.5
##
## Step: AIC=-2419.65
## glyhb ~ stab.glu + age + ratio + waist
##
## Df Sum of Sq RSS AIC
## + time.ppn 1 0.0064607 0.47269 -2422.6
## + stab.glu:age 1 0.0051982 0.47395 -2421.6
## + location 1 0.0044587 0.47469 -2421.1
## <none> 0.47915 -2419.7
## + age:ratio 1 0.0008567 0.47829 -2418.3
##
## Step: AIC=-2422.62
## glyhb ~ stab.glu + age + ratio + waist + time.ppn
##
## Df Sum of Sq RSS AIC
## + stab.glu:time.ppn 1 0.0127963 0.45989 -2430.7
## + stab.glu:age 1 0.0064115 0.46627 -2425.6
## + location 1 0.0031451 0.46954 -2423.1
## <none> 0.47269 -2422.6
## + age:ratio 1 0.0011950 0.47149 -2421.5
##
## Step: AIC=-2430.66
## glyhb ~ stab.glu + age + ratio + waist + time.ppn + stab.glu:time.ppn
##
## Df Sum of Sq RSS AIC
## + stab.glu:age 1 0.0068912 0.45300 -2434.2
## + location 1 0.0041805 0.45571 -2432.0
## <none> 0.45989 -2430.7
## + age:ratio 1 0.0016075 0.45828 -2429.9
##
## Step: AIC=-2434.19
## glyhb ~ stab.glu + age + ratio + waist + time.ppn + stab.glu:time.ppn +
## stab.glu:age
##
## Df Sum of Sq RSS AIC
## + location 1 0.0045058 0.44849 -2435.8
## + age:ratio 1 0.0033804 0.44962 -2434.9
## <none> 0.45300 -2434.2
##
## Step: AIC=-2435.85
## glyhb ~ stab.glu + age + ratio + waist + time.ppn + location +
## stab.glu:time.ppn + stab.glu:age
##
## Df Sum of Sq RSS AIC
## + age:ratio 1 0.0027568 0.44574 -2436.1
## <none> 0.44849 -2435.8
##
## Step: AIC=-2436.1
## glyhb ~ stab.glu + age + ratio + waist + time.ppn + location +
## stab.glu:time.ppn + stab.glu:age + age:ratio
##
## Call:
## lm(formula = glyhb ~ stab.glu + age + ratio + waist + time.ppn +
## location + stab.glu:time.ppn + stab.glu:age + age:ratio,
## data = diab)
##
## Coefficients:
## (Intercept) stab.glu age ratio
## 3.355e-01 -7.876e-04 -8.332e-04 2.777e-03
## waist time.ppn locationLouisa stab.glu:time.ppn
## -1.034e-03 3.580e-05 6.642e-03 -4.619e-07
## stab.glu:age age:ratio
## 7.341e-06 -1.209e-04
Performing forward selection procedure based on AIC, we find the best model is the modelfs2, reassuring that our modelfs2 is the best choice in minimizing AIC.